Network Science Overview#
Network science is a field that studies the structure and behavior of complex networks. These networks can arise in various contexts, such as social interactions, biological systems, technological infrastructures, and more.
History#
Network science has evolved from the mathematical rigor of graph theory, a field that originated with Euler’s solution to the Seven Bridges of Königsberg problem in the 18th century. It grew significantly with the work of Watts and Strogatz on small-world networks in 1998 and Barabási and Albert on scale-free networks in 1999.
Major Concepts in Network Science#
Nodes and Edges#
Nodes (or vertices) represent individual entities within a network.
Edges (or links) signify the connections or interactions between these entities. These can be literal links (between nodes in the internet or between neurons in the brain) or abstract/metaphorical (friendship, acquaintenceship, collaboration) or based on similarity (most common in applications to language, where we might link words based on similarity in form or meaning).
Types of Networks#
Undirected vs. Directed: In undirected networks, edges have no orientation, while in directed networks, they do. An undirected edge implies communication in both directions.
Weighted vs. Unweighted: Edges in weighted networks have assigned weights, often indicating the strength or capacity of the connection. Think of unweighted nodes as having a weight of 1.
Static vs. Dynamic: Static networks remain constant over time, whereas dynamic networks change.
Network Models#
Random Networks: Nodes are connected by edges randomly.
Fully-connected Networks: Every node is connected to every other node.
Scale-Free Networks: Include hubs with a large number of connections.
Small-World Networks: Feature short paths between nodes and high clustering.
Small-World Networks vs. Scale-Free Networks#
Small-world networks and scale-free networks are distinct types of networks with unique properties. While there is some overlap in the properties of small-world and scale-free networks (many real-world networks exhibit both sets of characteristics), they are fundamentally different in their defining features.
Small-World Networks#
Small-world networks are characterized by:
High Clustering: Nodes tend to create tightly knit groups characterized by a high density of connections.
Short Average Path Lengths: The path between any two nodes is relatively short, reflecting the “six degrees of separation” concept.
These networks are named for their ability to maintain close connections on average even as they grow in size.
Scale-Free Networks#
Scale-free networks are defined by a degree distribution that follows a power law:
Presence of Hubs: A few nodes have a very high degree, acting as central connection points.
Robustness and Vulnerability: The network is generally robust to random node failures but vulnerable to the targeted removal of hubs.
Distinctions and Misconceptions#
Small-world networks are not necessarily a subset of scale-free networks, nor are scale-free networks always small-world. The key distinction lies in the degree distribution and clustering properties.
Clustering vs. Degree Distribution: A small-world network has high clustering but does not require a power-law degree distribution. Conversely, a scale-free network has a power-law degree distribution but does not necessarily have high clustering.
Importance of Random Networks#
Random networks are a cornerstone of network theory, providing insights into how complex network behavior and structures emerge from stochastic processes. The logic is similar to that in statistical inference, where we look at a pattern of data and ask how unlikely that pattern is by comparing it to patterns that emerge when we take (or simulate taking) random samples. Random networks are fundamental in network science for several key reasons:
Benchmarking and Theoretical Understanding: Random networks act as a baseline, or null model, allowing researchers to detect non-random structures in real-world networks. This comparison can uncover the underlying mechanisms that drive network formation. A typical approach is to generate a network with the approximately the same number of nodes and edges as a ‘real’ network you want to compare to. The random network (or better, multiple random networks) provides a baseline in that you can assess in what ways your ‘real’ network appears to differ substantially from the random network.
Understanding of Phase Transitions: Studying random networks has provided insights into the critical thresholds at which networks undergo structural changes, such as the formation of a giant component (e.g., the proportion of possible connections that must exist before one large connected component emergs). These insights are analogous to phase transitions in physics.
Basis for More Complex Models: They are the foundation upon which more complex network models are built. By introducing additional rules to the formation of random networks, scientists can model small-world and scale-free networks. We can discuss this in class.
Modeling Random Processes: For systems where interactions are truly stochastic, random networks provide an appropriate model for understanding the system’s behavior, such as in some communication and biological networks.
Importance of Fully Connected Networks#
Fully connected networks, while not common in real-world scenarios, offer an important baseline for understanding the potential and limitations of network connectivity. The path length between all nodes is 1 because every node is connected to every other node.
Idealized Connectivity: These networks represent the state of maximum possible connectivity, providing a contrast to the sparser real-world networks.
Maximum Redundancy: They exhibit maximum path redundancy, which can inform studies on network resilience and robustness against failures.
Benchmark for Network Capacity: As a benchmark for network capacity, they highlight the limits of network traffic in communication systems.
Theoretical Extreme: They embody a theoretical extreme of network density, useful for exploring network properties under such conditions.
Network Topology#
Degree Distribution: Reflects how connections are distributed among the nodes in a network.
Clustering Coefficient (\( C \)): A measure of the degree to which nodes in a network tend to cluster together.
Path Length: The average shortest path length between pairs of nodes in a network.
Components: In network science, the structure and connectivity of a network are often analyzed in terms of its components. A component in a network is a subset of the network where there is a path between any two nodes within the subset. Components are isolated from each other, meaning there are no edges between nodes in different components.
In mathematical terms, a component of a graph is a maximal connected subgraph. Every node in a component is reachable from every other node in the same component through some path.
Giant Component: The giant component is the largest component in a network and typically contains a significant majority of all nodes. In many real-world networks, the giant component includes a large fraction of the entire network, highlighting a kind of “small-world” effect.
The emergence of a giant component is often described in the context of percolation theory and phase transitions. For example, in random networks, there is a critical threshold of connectivity above which a giant component forms.
Community and Modularity
A community in network science refers to a group of nodes that are more densely connected to each other than to nodes outside the group. Communities can often be components if the density within is high and the connections between communities are very sparse.
Modularity is a measure that quantifies the strength of division of a network into communities (or modules). High modularity indicates a network structure with dense connections between nodes within communities but sparse connections between nodes in different communities.
Relation to the giant component. The concept of a giant component is related to community structure in the sense that a giant component may encompass several communities within it. The giant component’s robustness and connectivity can often be a result of the interplay between these tightly-knit communities.
Understanding components, and particularly the giant component, is crucial for analyzing the robustness and functionality of networks, as well as for understanding how processes such as the spread of information or diseases occur within a network.
Network Science Statistics#
Microscale (Node-Level)#
Degree (\( k \)): Assesses the number of connections a node has. It’s important because nodes with higher degrees can have a larger influence or be more integral to the network’s robustness.
Betweenness Centrality (\( c_B \)): Quantifies how often a node appears on the shortest paths between other nodes. This is important for identifying nodes that serve as critical bridges within the network.
Closeness Centrality (\( c_C \)): Measures how close a node is to all other nodes in the network, indicating how quickly it can interact with all others.
Eigenvector Centrality: Eigenvector centrality assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question. Here, the centrality of a node is proportional to the sum of the centralities of its neighbors. This measure is used to estimate the influence of a node in a network. It can be particularly insightful in networks where not all connections are equally valuable.
Mesoscale#
Community Structure: The extent to which a network divides naturally into tightly knit groups. This is important for understanding the modularity of the network and can reveal functional clusters.
Motifs: Specific subgraphs that recur at a higher frequency than in randomized networks. They are important because they may represent fundamental building blocks of complex networks.
Macroscale#
Diameter (\( D \)): The longest of all the shortest paths in a network. It indicates the network’s ‘size’ in terms of the distance between its most distant nodes.
Density (\( \delta \)): The proportion of potential connections in a network that are actual connections, reflecting how closely knit the network is.
Modularity (\( Q \)): The strength of division of a network into modules (also known as communities). A network with high modularity has dense connections between the nodes within modules but sparse connections between nodes in different modules.
Understanding these metrics helps us comprehend the functionality, efficiency, and robustness of networks, which can be crucial for fields ranging from epidemiology to organizational behavior.
Examples of Complex Systems and Their Network Metrics#
Microscale#
High Degree: Social media influencers or hub airports in transportation networks.
Low Degree: Isolated individuals or small local airports.
High Betweenness Centrality: Central servers in the Internet or key transit stations in urban transportation.
Low Betweenness Centrality: Peripheral nodes with minimal control over the network’s flow.
High Closeness Centrality: Well-connected individuals in a tight-knit community.
Low Closeness Centrality: Remote nodes in a large-scale network, like distant towns.
Mesoscale (medium – substructures within the network)#
High Community Structure: Ethnic enclaves in a city or departments within a university.
Low Community Structure: Homogenized populations or interdisciplinary research groups.
Frequent Motifs: Neural networks with specific signal processing patterns.
Infrequent Motifs: Newly formed social networks without established interaction patterns.
Macroscale (whole-system, the entire network or graph)#
Large Diameter: The Internet, as it spans a large geographical area.
Small Diameter: A small office network, where all computers are closely connected.
High Density: A group of close friends, where everyone knows each other.
Low Density: A professional network with many acquaintances but few close relationships.
High Modularity: Biological systems, such as cells with different functional compartments.
Low Modularity: A start-up company with a flat and interconnected organizational structure.
Some more key concepts#
Assortativity#
Assortativity in a network refers to the preference for a network’s nodes to attach to others that are similar in some way. In social networks, this might mean that people often form connections with others who are similar in age, socioeconomic status, or education level.
Assortativity Coefficient: A normalized measure that quantifies the level of assortativity in a network, ranging from -1 (disassortative) to 1 (assortative).
Importance: High assortativity may indicate robustness against targeted attacks but can also suggest vulnerability to systemic failures or rapid spread of information (or misinformation).
Transitivity#
Transitivity measures the likelihood that the adjacent nodes of a node are connected to each other, providing insight into the clustering within the network.
Global Clustering Coefficient: The ratio of the number of closed triplets (or triangles) to the total number of triplets (both open and closed).
Significance: A high global clustering coefficient suggests a tightly knit community structure, which can affect the network’s dynamics, such as the resilience to the spread of diseases.
Edge Density#
Edge density reflects how close a network is to being a complete (fully connected) graph, where every possible edge is present.
Density Formula: \( \delta = \frac{2|E|}{|V|(|V|-1)} \), where \( |E| \) is the number of edges and \( |V| \) is the number of nodes.
Utility: Density can reveal the overall connectivity of the network, the potential for redundancy in connections, and the efficiency of information or resource transfer.
These concepts, among others, are tools that can dissect the complex anatomy of networks, revealing underlying patterns and properties that might not be apparent at first glance.
A concrete example using word form similarity#
Let’s read in orthographic (spelling) and phonological (sound) forms of words from the LexFindR package in R. I’ve saved the lemmalex lexicon from LexFindR and added it to our directory in the repository.
First step is to import some packages we’ll be needing and then read in the lexicon.
import networkx as nx
import time
from collections import defaultdict
import pandas as pd
import numpy as np
def substitution_neighbors(word1, word2):
"""Check if two words differ by exactly one letter substitution."""
# since we do not allow deletions or additions, we can skip word pairs that are different lengths to speed things up
if len(word1) != len(word2):
return False
# cool python trick to compare strings
differences = sum(1 for a, b in zip(word1, word2) if a != b)
# only returns True if the difference is 1 letter
return differences == 1
# 17,750 words from the LexFindR R package
file_path = "./lemmalex.csv"
# Read the CSV file
alldata = pd.read_csv(file_path, na_filter=False)
# Ensure all data in 'Item' is treated as string
alldata['Item'] = alldata['Item'].astype(str)
# Create a new field 'Letters' which is the length of the 'Item' string
alldata['Letters'] = alldata['Item'].str.len() # Using .str.len() ensures that the data is treated as a string
print(len(alldata))
# Display the first few rows of the DataFrame to confirm the new 'Letters' column
# Note that we could use:
# print(alldata.head())
# and that would be more 'generic' and would work in a vanilla python environment,
# but just giving the next line prints a prettier version in jupyter, with the header
# row in bold and separated by a line and every other row shaded grey
alldata.head()
17750
| Item | Frequency | Pronunciation | Letters | |
|---|---|---|---|---|
| 0 | a | 20415.27 | AH | 1 |
| 1 | abandon | 8.10 | AH B AE N D IH N | 7 |
| 2 | abandonment | 0.96 | AH B AE N D AH N M AH N T | 11 |
| 3 | abate | 0.10 | AH B EY T | 5 |
| 4 | abbey | 3.18 | AE B IY | 5 |
Now let’s reduce it to words of a single length – 4 letters. We are going to create a graph later on just using substitution neighbors. If we do this for the full lexicon, there will be distinct subgraphs for each length that won’t be connected to each other. So we will reduce to one length, which will also give us a much smaller set of words to work with, which will allow us to do some interesting things fairly quickly.
# Given our neighbor definition, 4-letter words cannot be neighbors to words of
# different lengths. If we used this for our full lexicon, we would get unconnected
# subgraphs for each letter length. To keep things speedy, let's just use length 4
# Filter the dataframe to only have items with 'Letters' > 3 and < 5 -- length 4
data = alldata[(alldata['Letters'] > 3) & (alldata['Letters'] < 5)]
#data=alldata
# Display the column names and the first few rows of the DataFrame to confirm the correct structure
print(len(data))
print(data.head()) # code block only does the last non-print call, so let's print the head
data.tail()
1417
Item Frequency Pronunciation Letters
13 abet 0.10 AH B EH T 4
23 able 159.90 EY B AH L 4
131 ache 2.49 EY K 4
134 acid 9.96 AE S IH D 4
152 acre 1.82 EY K AH R 4
| Item | Frequency | Pronunciation | Letters | |
|---|---|---|---|---|
| 17735 | zero | 21.45 | Z IH R OW | 4 |
| 17736 | zest | 0.69 | Z EH S T | 4 |
| 17738 | zinc | 0.84 | Z IH NG K | 4 |
| 17744 | zone | 20.12 | Z OW N | 4 |
| 17749 | zoom | 3.55 | Z UW M | 4 |
Creating the network / graph#
Next block uses the NetworkX package to create a substitution network from our 4-letter words based on spelling.
print(f'Starting network creation with {str(len(data))} entries from {file_path}')
# Create an empty undirected graph
G = nx.Graph()
# Add words from the 'Item' column as nodes to the graph
G.add_nodes_from(data['Item'])
# Create a dictionary to group words by their length
words_by_length = {}
excluded_items_count = 0
for idx, word in enumerate(data['Item']):
if pd.notnull(word): # This will check if the word is not NaN or None
str_word = str(word)
words_by_length.setdefault(len(str_word), []).append(str_word)
else:
# Print the entire row that contains the bad data
print(f"Excluded row at index {idx}:\n{data.iloc[idx]}")
excluded_items_count += 1
# Add edges between words of the same length that differ by only one letter
start_time = time.time()
processed_items_count = 0
for length, words in words_by_length.items():
for i, word1 in enumerate(words):
for word2 in words[i+1:]: # Only compare each pair once
if substitution_neighbors(word1, word2):
G.add_edge(word1, word2)
processed_items_count += 1
if processed_items_count % 250 == 0:
elapsed_time = time.time() - start_time
seconds_per_word = elapsed_time / processed_items_count
words_left = len(data['Item']) - processed_items_count
seconds_to_go = words_left * seconds_per_word
print(f"{processed_items_count} items processed. Average time per word: {seconds_per_word:.4f} seconds. Estimated time to go: {seconds_to_go:.2f} seconds.")
total_time = time.time() - start_time
print (f'---> Created network in {total_time:.2f} seconds')
print(f'Excluded {excluded_items_count} items due to NaN or None')
# Display basic information about the updated graph
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
print(f'{num_nodes} nodes with {num_edges} edges')
Starting network creation with 1417 entries from ./lemmalex.csv
250 items processed. Average time per word: 0.0007 seconds. Estimated time to go: 0.80 seconds.
500 items processed. Average time per word: 0.0006 seconds. Estimated time to go: 0.56 seconds.
750 items processed. Average time per word: 0.0005 seconds. Estimated time to go: 0.36 seconds.
1000 items processed. Average time per word: 0.0005 seconds. Estimated time to go: 0.19 seconds.
1250 items processed. Average time per word: 0.0004 seconds. Estimated time to go: 0.07 seconds.
---> Created network in 0.51 seconds
Excluded 0 items due to NaN or None
1417 nodes with 4460 edges
Next let’s get some basic network properties#
This will print the average clustering coefficient and the average shortest path between node pairs. It takes several seconds to run.
# Calculate the average clustering coefficient for the graph
avg_clustering_coefficient = nx.average_clustering(G)
# Since the graph is not fully connected, we find the largest connected component for the average shortest path length
largest_cc = max(nx.connected_components(G), key=len)
subgraph = G.subgraph(largest_cc)
# Calculate the average shortest path length of the largest connected component
try:
avg_shortest_path_length = nx.average_shortest_path_length(subgraph)
except nx.NetworkXError as e:
avg_shortest_path_length = str(e) # If it's not connected, we report the error message instead
avg_clustering_coefficient, avg_shortest_path_length
(0.3442593256015412, 5.828235925603215)
Let’s compare to a random baseline#
Let’s create a random network with similar numbers of nodes and edges and see what its average clustering coefficient and shortest path length values are.
import random
# Create a random graph with the same number of nodes and approximately the same number of edges
random_graph = nx.gnm_random_graph(num_nodes, num_edges)
# Calculate the average clustering coefficient for the random graph
random_avg_clustering_coefficient = nx.average_clustering(random_graph)
# Calculate the average shortest path length of the largest connected component in the random graph
random_largest_cc = max(nx.connected_components(random_graph), key=len)
random_subgraph = random_graph.subgraph(random_largest_cc)
# Calculate the average shortest path length of the largest connected component
try:
random_avg_shortest_path_length = nx.average_shortest_path_length(random_subgraph)
except nx.NetworkXError as e:
random_avg_shortest_path_length = str(e) # If it's not connected, we report the error message instead
(random_avg_clustering_coefficient, random_avg_shortest_path_length)
(0.004199846260537863, 4.14665160609953)
Interestingly, the average CC is much lower in the random graph, but the mean shorest path lengths are similar. What can we infer from this…?
Let’s examine more metrics#
Let’s make a simple function that provides a graph analysis.
import networkx as nx
from networkx.algorithms import community
def analyze_graph(G):
start_time = time.time()
print('# Starting graph analysis...')
# Basic details
number_of_nodes = G.number_of_nodes()
number_of_edges = G.number_of_edges()
total_possible_connections = number_of_nodes * (number_of_nodes - 1) // 2
# Degree Distribution calculations
degrees = [deg for node, deg in G.degree()]
mean_degree = sum(degrees) / number_of_nodes
proportion_zero_connections = degrees.count(0) / number_of_nodes
# Connected Components calculations
connected_components = list(nx.connected_components(G))
num_connected_components = len(connected_components)
num_unconnected_components = number_of_nodes - num_connected_components
proportion_of_total_possible = num_connected_components / number_of_nodes
# Diameter of the largest connected component
largest_cc = max(connected_components, key=len)
subgraph = G.subgraph(largest_cc)
if nx.is_connected(subgraph):
diameter = nx.diameter(subgraph)
else:
diameter = 'Graph is not fully connected'
# Other metrics
betweenness_centrality = nx.betweenness_centrality(subgraph)
closeness_centrality = nx.closeness_centrality(subgraph)
eigenvector_centrality = nx.eigenvector_centrality(subgraph, max_iter=1000)
assortativity = nx.degree_assortativity_coefficient(G)
communities = community.greedy_modularity_communities(subgraph)
modularity = community.modularity(subgraph, communities)
edge_density = nx.density(G)
transitivity = nx.transitivity(G)
# Compile the results into a dictionary
graph_metrics = {
'Number of Nodes': number_of_nodes,
'Number of Edges': number_of_edges,
'Total Possible Connections': total_possible_connections,
'Mean Degree': mean_degree,
'Proportion of Isolated Nodes': proportion_zero_connections,
'Number of Connected Components': num_connected_components,
'Proportion of Total Possible Components': proportion_of_total_possible,
'Number of Unconnected Components': num_unconnected_components,
'Diameter of Largest Component': diameter,
'Betweenness Centrality': betweenness_centrality,
'Closeness Centrality': closeness_centrality,
'Eigenvector Centrality': eigenvector_centrality,
'Assortativity': assortativity,
'Modularity': modularity,
'Edge Density': edge_density,
'Transitivity': transitivity,
}
# Building the report string
report = f"""
Network Analysis Report
-----------------------
Basic Network Details:
Number of Nodes: {number_of_nodes}
Number of Edges: {number_of_edges}
Total Possible Connections in a Fully-Connected Graph: {total_possible_connections}
1. Mean Degree:
The mean degree is the average number of connections per node.
Mean Degree: {mean_degree}
2. Proportion of Isolated Nodes:
This is the proportion of nodes with no connections.
Proportion of Isolated Nodes: {proportion_zero_connections}
3. Number of Connected Components:
This is the number of connected subgraphs where each node is reachable from every other node in the subgraph.
Number of Connected Components: {num_connected_components}
Total Possible Number of Components: {number_of_nodes}
Proportion of Total Possible Components: {proportion_of_total_possible:.4f}
4. Number of Unconnected Components:
These are components with no connection to other components.
Number of Unconnected Components: {num_unconnected_components}
5. Diameter of Largest Component:
The diameter is the longest shortest path between any two nodes in the network.
Diameter of Largest Component: {graph_metrics['Diameter of Largest Component']}
6. Betweenness Centrality:
Betweenness centrality measures the extent to which a node lies on paths between other nodes.
The betweenness centrality values are calculated.
7. Closeness Centrality:
Closeness centrality measures how close a node is to all other nodes in the network.
The closeness centrality values are calculated.
8. Eigenvector Centrality:
Eigenvector centrality measures a node's influence based on the influence of its neighbors.
The eigenvector centrality values are calculated.
9. Assortativity:
Assortativity measures the similarity of connections in the graph with respect to the node degree.
The assortativity coefficient of the network is: {assortativity}
10. Modularity:
Modularity measures the strength of division of a network into modules (also called communities).
The modularity of the network is: {modularity}
11. Edge Density:
Edge density measures the ratio of the number of edges to the number of possible edges in the network.
The edge density of the network is: {edge_density}
12. Transitivity:
Transitivity is the overall level of 'cliquishness' of the network.
The transitivity of the network is: {transitivity}
"""
print(f'# Generated graph report in {time.time()-start_time} seconds.')
return graph_metrics, report
# Example usage:
# Assume 'G' is a NetworkX graph object.
metrics, report = analyze_graph(G)
print(report)
# Starting graph analysis...
# Generated graph report in 14.967169046401978 seconds.
Network Analysis Report
-----------------------
Basic Network Details:
Number of Nodes: 1417
Number of Edges: 4460
Total Possible Connections in a Fully-Connected Graph: 1003236
1. Mean Degree:
The mean degree is the average number of connections per node.
Mean Degree: 6.294989414255469
2. Proportion of Isolated Nodes:
This is the proportion of nodes with no connections.
Proportion of Isolated Nodes: 0.05434015525758645
3. Number of Connected Components:
This is the number of connected subgraphs where each node is reachable from every other node in the subgraph.
Number of Connected Components: 99
Total Possible Number of Components: 1417
Proportion of Total Possible Components: 0.0699
4. Number of Unconnected Components:
These are components with no connection to other components.
Number of Unconnected Components: 1318
5. Diameter of Largest Component:
The diameter is the longest shortest path between any two nodes in the network.
Diameter of Largest Component: 15
6. Betweenness Centrality:
Betweenness centrality measures the extent to which a node lies on paths between other nodes.
The betweenness centrality values are calculated.
7. Closeness Centrality:
Closeness centrality measures how close a node is to all other nodes in the network.
The closeness centrality values are calculated.
8. Eigenvector Centrality:
Eigenvector centrality measures a node's influence based on the influence of its neighbors.
The eigenvector centrality values are calculated.
9. Assortativity:
Assortativity measures the similarity of connections in the graph with respect to the node degree.
The assortativity coefficient of the network is: 0.5352440464719707
10. Modularity:
Modularity measures the strength of division of a network into modules (also called communities).
The modularity of the network is: 0.7130377843781806
11. Edge Density:
Edge density measures the ratio of the number of edges to the number of possible edges in the network.
The edge density of the network is: 0.004445613993118269
12. Transitivity:
Transitivity is the overall level of 'cliquishness' of the network.
The transitivity of the network is: 0.3850213621512943
Improving the graph report#
That’s nice, but kind of wordy. Let’s make a dataframe instead. This could have advantages for comparing analyses of two graphs, for example. This function is set up to generate a single random graph with similar node and edge counts as the grpah we send it. This allows us to compare our graph’s characteristics to the random graph.
import networkx as nx
import pandas as pd
from networkx.algorithms import community
import numpy as np
import time
def analyze_graph(G, compare_random=True, max_iter=10000, return_node_df=False):
start_time = time.time()
print(f'# Starting graph analysis')
# Helper function to calculate the statistical summary for a list of values
def stats_summary(values):
return [np.round(metric(values), 4) for metric in (np.mean, np.median, np.std, np.min, np.max)]
# Helper function to calculate metrics
def calculate_metrics(graph):
# Basic metrics
metrics = {
'Nodes': graph.number_of_nodes(),
'Edges': graph.number_of_edges(),
'Isolated Nodes': nx.number_of_isolates(graph),
'Edge Density': nx.density(graph),
'Transitivity': nx.transitivity(graph),
'Assortativity': nx.degree_assortativity_coefficient(graph),
}
# Calculate statistical summaries
degrees = list(dict(graph.degree()).values())
metrics['Degree'] = stats_summary(degrees)
clustering = list(nx.clustering(graph).values())
metrics['Clustering Coefficient'] = stats_summary(clustering)
# Initialize shortest path and diameter metrics
metrics['Shortest Path'] = 'Graph not connected'
metrics['Diameter of Giant Component'] = 'Undefined'
if nx.is_connected(graph):
sp_lengths = dict(nx.all_pairs_shortest_path_length(graph))
sp_length_values = [length for lengths in sp_lengths.values() for length in lengths.values()]
metrics['Shortest Path'] = stats_summary(sp_length_values)
metrics['Diameter of Giant Component'] = nx.diameter(graph)
# Centrality measures
metrics['Betweenness Centrality'] = stats_summary(list(nx.betweenness_centrality(graph).values()))
metrics['Closeness Centrality'] = stats_summary(list(nx.closeness_centrality(graph).values()))
metrics['Eigenvector Centrality'] = stats_summary(list(nx.eigenvector_centrality_numpy(graph, max_iter=max_iter).values()))
# Modularity for the giant component and the full graph if connected
if nx.is_connected(graph):
communities = community.greedy_modularity_communities(graph)
metrics['Modularity (Full Graph)'] = community.modularity(graph, communities)
else:
metrics['Modularity (Full Graph)'] = 'Undefined'
giant_component = graph.subgraph(max(nx.connected_components(graph), key=len))
communities_gc = community.greedy_modularity_communities(giant_component)
metrics['Modularity (Giant Component)'] = community.modularity(giant_component, communities_gc)
if nx.is_connected(giant_component):
sp_lengths_gc = dict(nx.all_pairs_shortest_path_length(giant_component))
sp_length_values_gc = [length for lengths in sp_lengths_gc.values() for length in lengths.values()]
metrics['Shortest Path (Giant Component)'] = stats_summary(sp_length_values_gc)
else:
metrics['Shortest Path (Giant Component)'] = 'Giant component not connected'
# Components and their sizes
component_sizes = [len(c) for c in sorted(nx.connected_components(graph), key=len, reverse=True)]
unique_sizes = {size: component_sizes.count(size) for size in set(component_sizes)}
components_str = ", ".join(f"{size}:{count}" for size, count in sorted(unique_sizes.items(), key=lambda x: x[0], reverse=True))
metrics['Number of Components'] = len(component_sizes)
metrics['Components by Size'] = components_str
return metrics
# Calculate metrics for the original graph
original_metrics = calculate_metrics(G)
# If comparison with random graph is requested
random_metrics = {}
if compare_random:
nodes = G.number_of_nodes()
p = original_metrics['Edge Density']
random_graph = nx.gnp_random_graph(n=nodes, p=p)
random_metrics = calculate_metrics(random_graph)
# Combine the metrics into a DataFrame
metrics_data = {
'Metric': [],
'Original Graph': [],
'Random Graph': []
}
# Append metrics to the data structure
for metric, value in original_metrics.items():
metrics_data['Metric'].append(metric)
metrics_data['Original Graph'].append(value)
metrics_data['Random Graph'].append(random_metrics.get(metric, 'Not Compared'))
metrics_df = pd.DataFrame(metrics_data)
# Create the node DataFrame if requested
node_df = None
if return_node_df:
nodes_data = {
'Node': list(G.nodes()),
'Degree': list(dict(G.degree()).values()),
'Clustering Coefficient': list(nx.clustering(G).values()),
'Betweenness Centrality': list(nx.betweenness_centrality(G).values()),
'Closeness Centrality': list(nx.closeness_centrality(G).values()),
'Eigenvector Centrality': list(nx.eigenvector_centrality_numpy(G, max_iter=max_iter).values())
}
if nx.is_connected(G):
# Compute shortest path length for all nodes
sp_length_dict = dict(nx.all_pairs_shortest_path_length(G))
mean_sp_length = {node: np.mean(list(lengths.values())) for node, lengths in sp_length_dict.items()}
nodes_data['Mean Shortest Path'] = list(mean_sp_length.values())
else:
# If the graph is not connected, we assign NaN for mean shortest path length
nodes_data['Mean Shortest Path'] = [np.nan] * len(nodes_data['Node'])
node_df = pd.DataFrame(nodes_data)
print(f'# Analyzed graph in {time.time() - start_time} seconds')
return metrics_df, node_df
# Example usage:
#G = nx.gnp_random_graph(10, 0.5)
# Now let's call the function with the compare_random flag set to True
metrics_df, node_df = analyze_graph(G, compare_random=True, return_node_df=True)
#print(metrics_df.to_string(index=False))
# Starting graph analysis
# Analyzed graph in 21.72000217437744 seconds
Printing and understanding the report#
The report is a dataframe.
For items that have a list of 5 numbers, those are the mean, median, standard deviation, minimum, and maximum (because those stats are calculated for each node).
The ‘Components by size’ indicates that for the Original graph, there are 99 components: 1 of size 1278, 1 of size 7, 1 of size 5, 4 of size 4, 4 of size 3, 11 of size 2, and 77 of size 1.
Let’s look at the table, and then we’ll dig into what things mean down below.
The node_df dataframe has word-specific data in it that we may want to use later…
print(len(node_df))
print(node_df.head())
# If we just specify the dataframe we get a very pretty table from jupyter (rather than using a print statement)
metrics_df
1417
Node Degree Clustering Coefficient Betweenness Centrality \
0 abet 0 0.0 0.0
1 able 1 0.0 0.0
2 ache 1 0.0 0.0
3 acid 3 1.0 0.0
4 acre 1 0.0 0.0
Closeness Centrality Eigenvector Centrality Mean Shortest Path
0 0.000000 -6.826895e-19 NaN
1 0.000706 -4.188225e-18 NaN
2 0.000706 2.295249e-18 NaN
3 0.002119 -5.333634e-18 NaN
4 0.000706 -1.038642e-18 NaN
| Metric | Original Graph | Random Graph | |
|---|---|---|---|
| 0 | Nodes | 1417 | 1417 |
| 1 | Edges | 4460 | 4481 |
| 2 | Isolated Nodes | 77 | 0 |
| 3 | Edge Density | 0.004446 | 0.004467 |
| 4 | Transitivity | 0.385021 | 0.002773 |
| 5 | Assortativity | 0.535244 | 0.030866 |
| 6 | Degree | [6.295, 6.0, 4.1488, 0, 20] | [6.3246, 6.0, 2.4533, 1, 15] |
| 7 | Clustering Coefficient | [0.3443, 0.3273, 0.2734, 0.0, 1.0] | [0.0024, 0.0, 0.0128, 0.0, 0.1667] |
| 8 | Shortest Path | Graph not connected | [4.1427, 4.0, 0.842, 0, 8] |
| 9 | Diameter of Giant Component | Undefined | 8 |
| 10 | Betweenness Centrality | [0.0028, 0.0015, 0.0038, 0.0, 0.0403] | [0.0022, 0.0018, 0.0017, 0.0, 0.0103] |
| 11 | Closeness Centrality | [0.1424, 0.1582, 0.0507, 0.0, 0.1985] | [0.2422, 0.2429, 0.0151, 0.1808, 0.2866] |
| 12 | Eigenvector Centrality | [0.0104, 0.001, 0.0245, -0.0, 0.1771] | [0.0235, 0.0215, 0.0123, 0.0012, 0.0882] |
| 13 | Modularity (Full Graph) | Undefined | 0.372249 |
| 14 | Modularity (Giant Component) | 0.713038 | 0.372249 |
| 15 | Shortest Path (Giant Component) | [5.8237, 6.0, 1.8901, 0, 15] | [4.1427, 4.0, 0.842, 0, 8] |
| 16 | Number of Components | 99 | 1 |
| 17 | Components by Size | 1278:1, 7:1, 5:1, 4:4, 3:4, 2:11, 1:77 | 1417:1 |
Breaking down the graph analysis#
This report compares the structural properties of an original graph with a random graph, which was generated to have a similar number of nodes and edges. Each metric is analyzed to understand the nature of each graph and to infer potential differences in their underlying structures.
Nodes
What it assesses: The total number of nodes (vertices) in the graph.
Analysis: Both graphs have the same number of nodes, 1417, which indicates that the comparison is on an equal footing in terms of graph size.
Edges
What it assesses: The total number of edges (links) between nodes in the graph.
Analysis: The original graph has slightly fewer edges than the random graph. This could suggest that the original graph might have a more sparse connectivity or more isolated nodes, which is indeed reflected in other metrics. But the difference is likely too small to really matter.
Isolated Nodes
What it assesses: The number of nodes that have no connections to other nodes.
Analysis: The original graph has a substantially higher number of isolated nodes, indicating a considerable number of nodes that do not connect with the main network. In a random graph, such a high number of isolates is less likely because edges are distributed randomly.
Edge Density
What it assesses: The ratio of the number of edges to the number of possible edges, giving an indication of how interconnected the graph is.
Analysis: Edge densities are almost identical, which means that despite the original graph having more isolated nodes, the overall level of interconnectedness is comparable to the random graph.
Transitivity
What it assesses: The overall likelihood of the graph to have connected triples of nodes (triangles). This is kind of like a clustering coefficient for the graph as a whole. High transitivity means that if node A is connected to both nodes B and C, then there is a high likelihood that B and C are also connected. This pattern is often indicative of a community or a clique within the network. This is assessed by calculating the relative proportion of possible triples that are connected with 3 edges vs. 2 (that is, we only consider cases of 3 nodes where A and B plus A and C or B and C are connected. Then we check if the third possible edge exists. High transitivity implies robustness in a system that passes messages. In our case, where we are looking at word similarity, high transitivity might suggest that there are many linked clusters of similar items (e.g., zest, rest, nest, test, fest, best, beat, feat, seat, neat, next, text, tent, dent, bent…)
Analysis: The original graph has a much higher transitivity, indicating that it has a higher tendency for nodes to form tightly knit groups. This is a characteristic of networks with community structures or social networks. The random graph’s low transitivity implies that connections are distributed more evenly and less cluster-like, which is typical for random networks.
Note: It is a bit more complex than just the ratio of triangles to triples. Specifically, it is 3 times that ratio. Why? Okay, let’s imagine you have three friends: Alice, Bob, and Charlie. If Alice knows Bob, and Alice knows Charlie, we have two friendships, right? But we have a potential third friendship directly between Bob and Charlie. We call these three friendships a “triple” because there’s a possibility of three friendships connecting the three friends. Now, let’s say Bob and Charlie also become friends. We now have a “triangle” because all three friends know each other. Here’s the fun part: In our “triple” before Bob and Charlie became friends, we could look at it from each person’s point of view: (1) From Alice’s point of view, there’s a potential triangle between Alice, Bob, and Charlie. (2) From Bob’s point of view, there’s a potential triangle between Bob, Alice, and Charlie. (3) From Charlie’s point of view, there’s a potential triangle between Charlie, Alice, and Bob. That’s three potential triangles (triples) from each friend’s point of view. When Bob and Charlie become friends, we complete all three of these potential triangles at the same time. That’s why we multiply the number of actual triangles by 3 when we calculate transitivity. It’s like giving one point for every friendship that completes a triangle from each person’s perspective. So, if we want to know what fraction of all potential friendships actually turned into real friendships, we multiply the number of real triangles by 3, because each triangle completes three potential friendships. This way, we can compare it to the total number of potential friendships (triples) that could have happened.
Assortativity: Assortativity in the context of network graphs refers to the tendency of nodes to connect with other nodes that are similar in some way. This similarity can be based on a node attribute, such as age, income level, or, in the case of a social network, the number of connections (degree). Specifically, degree assortativity measures the similarity of connections in terms of node degrees.
What it assesses: Degree assortativity in a network graph assesses the propensity of nodes to connect with other nodes that have a similar degree. If high-degree nodes tend to connect with other high-degree nodes, and low-degree nodes with other low-degree nodes, the network is said to exhibit positive assortativity or assortative mixing. Conversely, if high-degree nodes tend to connect with low-degree nodes, the network shows negative assortativity, or disassortative mixing.
Why it matters: Assortativity has implications for the robustness and structure of a network. Networks with positive assortativity are often more resilient to random failures, as high-degree nodes form a supportive core. In contrast, networks with negative assortativity may be more vulnerable but can facilitate faster spread or diffusion processes, like the spread of information or diseases.
Analysis: A network with high positive assortativity indicates a structured environment where similar nodes (in terms of the number of connections) preferentially connect to each other. This can be indicative of a social structure with pronounced community behavior or stratification. In contrast, a network with negative assortativity might suggest a more hierarchical structure, where central hubs are connected to peripheral nodes, common in biological and technological networks. For a word similarity network, positive assortativity could imply that words with a similar usage frequency are more likely to be similar in spelling, while negative assortativity might indicate that less common words (low degree) are often connected to more common words (high degree), possibly due to etymological or linguistic patterns. Assortativity in the context of network graphs refers to the tendency of nodes to connect with other nodes that are similar in some way. This similarity can be based on a node attribute, such as age, income level, or, in the case of a social network, the number of connections (degree). Specifically, degree assortativity measures the similarity of connections in terms of node degrees.
Degree
What it assesses: The average number of connections each node has.
Analysis: Both graphs have similar average and median degrees, but the original graph has a higher standard deviation, which could indicate a presence of hub nodes or a more varied distribution of connections per node, which could indicate small world properties (we will assess this formally later). In the random graph, connections are more uniformly distributed.
Clustering Coefficient
What it assesses: The degree to which nodes in the graph tend to cluster together. Literally, the proportion of a node’s neighbors that are neighbors of each other.
Analysis: The original graph exhibits a much higher clustering coefficient, suggesting that nodes tend to form tight-knit communities (possibly a clue to small world structure). The random graph’s low value indicates an absence of such clustering, which is expected in random networks.
Betweenness Centrality
What it assesses: The extent to which a node lies on paths between other nodes, indicating its importance in the flow of information or connectivity. However, here, we are not looking at information flow, but just structure based on similarity. So nodes with high betweeness would likely be important hubs that link word clusters.
Analysis: The original graph’s higher maximum betweenness centrality suggests that some nodes act as significant points of connection between other nodes. The random graph has a more uniform distribution, which is expected if connections are random.
Closeness Centrality
What it assesses: How close a node is to all other nodes in the graph, indicating how quickly information can spread from that node to others – or in our case, how close various nodes are in our similarity space.
Analysis: The closeness centrality is significantly higher in the random graph, suggesting that nodes in the random graph are, on average, closer to all other nodes. This reflects a more uniform network structure compared to the original graph.
Eigenvector Centrality
What it assesses: The influence of a node in a network, considering the influence of its neighbors. Do nodes tend to have similar influence as their neighbors?
Analysis: The original graph has a greater range in eigenvector centrality, indicating a variation in node influence based on the network’s structure (possibly indicating hubbiness). The random graph has more uniformly influential nodes.
Modularity (Giant Component)
What it assesses: The strength of the division of the graph into modules (or communities).
Analysis: The original graph’s giant component has high modularity, indicating strong community structures. The random graph’s lower modularity suggests that its structure is less clearly divided into communities.
Shortest Path (Giant Component)
What it assesses: The average number of steps along the shortest paths for all possible pairs of network nodes within the giant component.
Analysis: The original graph’s giant component has longer shortest paths on average, implying a more extended structure or less dense connectivity. The random graph’s giant component has shorter paths, which could be due to its more uniform connections. The difference is small, though.
Number of Components
What it assesses: The number of connected subgraphs in the network.
Analysis: The original graph has many more components, suggesting a fragmentation of the network into many small or isolated groups. The random graph has fewer components, which is typical for a random network with evenly distributed edges.
Components by Size
What it assesses: The distribution of component sizes in the network.
Analysis: The original graph has a wide variety of component sizes, including many single-node components, indicating a high degree of fragmentation. The random graph mostly consists of one large component and a few single-node components, reinforcing the uniformity of a random network.
Looking for motifs#
Motifs are connectivity patterns that recur throughout a graph. I don’t know of methods to ‘mine’ graphs to discover motifs, though they may exist. Instead, we have to look for interesting patterns and then find a way to search for and count them.
A simple motif to look for would be triangles: when we have node A connected to B and C, we can test whether B connects to C, forming a triangle. This would promote higher clustering, betweeness centrality, etc. This relates to assortavity as noted above. Here’s a simple check. We should get a value that is 1/3 of Transitivity from the table above.
import networkx as nx
from itertools import combinations
def find_triangles_and_ratio(G):
triples_count = 0
triangles_count = 0
triangles_list = []
for node in G.nodes():
neighbors = list(G.neighbors(node))
# Count the number of triples involving the current node
triples_count += len(list(combinations(neighbors, 2)))
for n1, n2 in combinations(neighbors, 2):
if G.has_edge(n1, n2):
# Sort the nodes to avoid duplicates
triangle = sorted([node, n1, n2])
if triangle not in triangles_list:
triangles_list.append(triangle)
triangles_count += 1
# The ratio of triangles to triples is the number of triangles divided by the number of triples
ratio = triangles_count / triples_count if triples_count > 0 else 0
return triples_count, triangles_count, ratio
# Example usage:
# G = nx.erdos_renyi_graph(n=10, p=0.5)
triples, triangles, ratio = find_triangles_and_ratio(G)
print(f"Number of triples: {triples}")
print(f"Number of triangles: {triangles}")
print(f"Ratio of triangles to triples: {ratio}")
Number of triples: 35811
Number of triangles: 4596
Ratio of triangles to triples: 0.12834045405043143
Visualizing graph properties#
Let’s plot distributions of key metrics in generic matplotlib.
import matplotlib.pyplot as plt
# Correcting the function to handle the generator object properly for shortest path lengths calculation
def calculate_shortest_path_lengths(G):
lengths = []
for component in nx.connected_components(G):
subgraph = G.subgraph(component)
if len(subgraph) > 1: # Only calculate for components with more than 1 node
for source_length_dict in nx.shortest_path_length(subgraph):
lengths.extend(source_length_dict[1].values())
return lengths
def plot_distributions(G):
# Prepare the layout for the plots
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
plt.subplots_adjust(hspace=0.4, wspace=0.4)
# Degree distribution
degrees = [G.degree(n) for n in G.nodes()]
axes[0, 0].hist(degrees, bins=range(1, max(degrees)+1), color='skyblue', edgecolor='black')
axes[0, 0].set_title('Degree Distribution')
axes[0, 0].set_xlabel('Degree')
axes[0, 0].set_ylabel('Frequency')
# Clustering coefficient distribution
clustering_coeffs = list(nx.clustering(G).values())
axes[0, 1].hist(clustering_coeffs, bins=np.linspace(0, 1, 21), color='salmon', edgecolor='black')
axes[0, 1].set_title('Clustering Coefficient Distribution')
axes[0, 1].set_xlabel('Clustering Coefficient')
axes[0, 1].set_ylabel('Frequency')
# Shortest path length distribution
sp_lengths = calculate_shortest_path_lengths(G)
if sp_lengths: # Only plot if we have lengths (i.e., the graph is not fully disconnected)
axes[0, 2].hist(sp_lengths, bins=range(1, max(sp_lengths)+1), color='lightgreen', edgecolor='black')
axes[0, 2].set_title('Shortest Path Length Distribution')
axes[0, 2].set_xlabel('Shortest Path Length')
axes[0, 2].set_ylabel('Frequency')
# Betweenness centrality distribution
betweenness = list(nx.betweenness_centrality(G).values())
axes[1, 0].hist(betweenness, bins=np.linspace(0, max(betweenness), 21), color='lightblue', edgecolor='black')
axes[1, 0].set_title('Betweenness Centrality Distribution')
axes[1, 0].set_xlabel('Betweenness Centrality')
axes[1, 0].set_ylabel('Frequency')
# Closeness centrality distribution
closeness = list(nx.closeness_centrality(G).values())
axes[1, 1].hist(closeness, bins=np.linspace(0, 1, 21), color='violet', edgecolor='black')
axes[1, 1].set_title('Closeness Centrality Distribution')
axes[1, 1].set_xlabel('Closeness Centrality')
axes[1, 1].set_ylabel('Frequency')
# Eigenvector centrality distribution
eigenvector = list(nx.eigenvector_centrality_numpy(G).values())
axes[1, 2].hist(eigenvector, bins=np.linspace(0, 1, 21), color='orange', edgecolor='black')
axes[1, 2].set_title('Eigenvector Centrality Distribution')
axes[1, 2].set_xlabel('Eigenvector Centrality')
axes[1, 2].set_ylabel('Frequency')
plt.show()
# Create a random graph and plot the distributions
#G = nx.gnp_random_graph(100, 0.05)
plot_distributions(G)
Assessing power-law distributions and small worldness#
Power Law Distribution#
When degree has a power law distribution, this is a strong indicator that the network may be scale free.
A power law distribution is a statistical distribution with the property that one quantity varies as a power of another. Mathematically, it can be represented as: \( P(x) \propto x^{-\alpha} \), where \( P(x) \) is the probability of the variable \( x \), and \( \alpha \) is a constant exponent that is typically greater than 1.
Significance and Implications#
The presence of a power law relationship in a dataset indicates:
Scale Invariance: The phenomenon has no characteristic scale, and similar patterns recur at various scales.
Heavy Tails: The distribution does not taper off as quickly as a Gaussian, implying a significant number of observations or events far from the mean.
Heterogeneity: There is a large variance in the properties of the elements of the system, with a few large events or observations dominating the distribution.
Complex Systems: Power laws often emerge in complex, self-organizing systems and are indicative of underlying processes that do not depend on a characteristic scale.
Robustness and Fragility: Systems with a scale-free topology, such as many real-world networks, are robust against random failures but fragile to targeted attacks on highly connected nodes.
Predictability Challenges: Due to heavy tails, extreme events are more common than in normal distributions, posing challenges for prediction and risk assessment.
Understanding power law distributions is crucial for analyzing systems with large-scale heterogeneities and for preparing for extreme events in various fields, from finance to natural disasters.
Examples of Real-World Power Law Distributions#
Several natural phenomena exhibit power law distributions, indicating scale-invariance and the presence of complex underlying processes. Here are some examples:
Earthquakes: The frequency of earthquakes follows the Gutenberg-Richter law, with the number of events exceeding a certain size diminishing exponentially with the magnitude.
River Networks: The distribution of river lengths and drainage areas, displaying fractal-like behavior that manifests in power law distributions.
City Sizes: Populations of cities follow Zipf’s law, where the rank of a city inversely correlates with its population size.
Solar Flares: The intensity and occurrence of solar flares adhere to a power law, with fewer large-scale flares but with significantly greater energy.
Biological Extinctions: The scale of mass extinctions follows a power law, with small extinctions being common and large ones rare but significant.
Distribution of Wealth: Wealth among individuals often follows Pareto’s distribution, with a small portion of the population holding a large fraction of total wealth.
Word Frequency: The frequency of word use in natural languages exhibits a power law distribution known as Zipf’s law, where many words have low frequency of occurrence and few have very high frequency of occurrence.
Biological Taxonomy: The hierarchical organization of taxonomic categories, like species per genus, shows power law distribution patterns.
Protein Interaction Networks: The number of interactions per protein in biological networks, with many proteins having few interactions and a few having many.
Meteorite Sizes: The sizes of meteorites that hit Earth, with numerous small meteorites and a few very large ones.
Assessing Power Law Distributions#
Determining whether a distribution follows a power law involves several steps, each critical to establishing the validity of the power law model for a given dataset. The simplest way of assessing whether a graph is scale free involves assessing the degree distribution in the following ways (though other characteristics can be important, such as having high clustering coefficiient or hierarchical/fractal structure. We won’t actually do any of these for this class, but this is meant to give you some ideas of what would be needed for real analysis.
Plotting the Data#
Log-Log Plot: Create a log-log plot of the data. A linear relationship in this space suggests power law behavior.
Statistical Tests#
Maximum Likelihood Estimation (MLE): Use MLE to estimate the power law exponent \( \alpha \). It provides a more accurate estimation than linear regression on log-log plots.
Goodness-of-Fit Tests: Conduct tests, such as the Kolmogorov-Smirnov test, to compare the observed data with the expected power law distribution. A significant deviation may indicate a poor fit.
Comparative Analysis#
Comparative Analysis: Compare the power law model with alternative distributions like exponential, log-normal, or stretched exponential using likelihood ratios and information criteria (e.g., AIC or BIC).
Range of Fit#
Range of Fit: Determine the range over which the power law applies, often power laws are only apparent above a certain threshold.
Underlying Mechanisms#
Underlying Mechanisms: Consider theoretical models or processes that could explain the emergence of a power law, such as preferential attachment or self-organized criticality.
Robustness Checks#
Robustness Checks: Evaluate the sensitivity of the power law parameters and the fitting range to ensure the robustness of the power law model.
In practice, confirming a power law distribution requires more than identifying a straight line on a log-log plot. It necessitates a careful and methodical approach, combining statistical methods, comparative model analysis, and theoretical reasoning. But here, we’ll just use our log-log plots heuristically. Later, we’ll create a somewhat simple test for power law distribution.
Assessing Small-World Properties of a Graph#
Small-world network structures are quite prevalent in both natural and human-made systems. Some examples of real-world complex systems that exhibit small-world properties include:
Examples of Real-World Systems with Small World Characteristics#
Social Networks: People are typically a few acquaintances away from each other (popularized by the concept of “six degrees of separation”). Social networks like Facebook and LinkedIn show small-world characteristics with clusters of friends or professional contacts and relatively short paths across the global network.
Neural Networks: Both the human brain and nervous systems of other animals display qualities of small-world networks in some circumstances. Neurons tend to form local clusters with strong interconnections, and there are relatively short paths between any two neurons in the brain.
The Power Grid: The electricity supply network has a small-world structure. There are high-density clusters of power lines in populated areas, and longer lines connect these clusters over significant distances.
The Internet: The connectivity of routers and autonomous systems on the Internet forms a small-world network, with dense local clusters and a few long-range connections providing short paths across the web.
Language Networks: Words in a language can be considered a network where two words are connected if they appear adjacently in sentences. Such networks exhibit the small-world property.
Scientific Collaboration Networks: In networks of co-authorship, where nodes represent scientists and edges represent collaborative papers, there are often tight clusters within fields and shorter paths across disciplines.
Protein Interaction Networks: In biology, proteins interact with each other in complex ways, often showing small-world network structures, which can be crucial for the understanding of biological processes.
Economic Networks: Trade networks linking businesses or countries have a small-world structure, with clusters in sectors or regions and some long-range connections that significantly shorten the path between any two nodes in the network.
Transportation Networks: Public transportation systems, airline networks, and shipping routes often show small-world characteristics, with dense connections within cities or regions and a few longer connections between them.
Ecological Networks: In ecological systems, species are interconnected through various relationships like predation or competition, often showing small-world properties.
These small-world networks are characterized by high clustering typical of regular lattices and the short average path lengths typical of random graphs. This combination facilitates efficient information transfer/interaction, resilience to random failures, and rapid synchronization, which are advantageous traits for many complex systems.
These small-world networks are characterized by high clustering typical of regular lattices and the short average path lengths typical of random graphs. This combination facilitates efficient information transfer, resilience to random failures, and rapid synchronization, which are advantageous traits for many complex systems.
To determine whether a graph has small-world properties, we need to compare it to an equivalent random graph in terms of the average path length and clustering coefficient.
Calculate the Average Path Length (L): This is the average number of steps along the shortest paths for all possible pairs of network nodes. A small-world network will have a small average path length similar to a random graph of the same size.
Calculate the Clustering Coefficient (C): This measures the degree to which nodes in a graph tend to cluster together. For a small-world network, the clustering coefficient will be significantly higher than that of a random graph.
Generate an Equivalent Random Graph: Create a random graph with the same number of nodes and edges as the original graph. This graph serves as a benchmark for comparison.
Calculate L and C for the Random Graph: Determine the average path length \( L_{random} \) and clustering coefficient \( C_{random} \) for the random graph.
Compare L and C: A small-world network is characterized by a much higher clustering coefficient \( C \) than \( C_{random} \) while maintaining an average path length \( L \) comparable to \( L_{random} \).
Calculate the Small-World Coefficient (sigma): The small-world-ness of a network can be quantified as \( \sigma = \frac{C / C_{random}}{L / L_{random}} \). If \( \sigma > 1 \), the network is considered to have small-world properties.
Consider the Degree Distribution: While not a defining characteristic, many small-world networks also exhibit a degree distribution that is not purely random, often with a higher occurrence of hubs (but not necessarily a power-law distribution).
Assess the Presence of Community Structures: Small-world networks often contain communities or modules, which are groups of nodes that are more densely connected to each other than to the rest of the network.
Visualizing properties#
Let’s just use log-log plots for now.
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
from scipy.stats import linregress
def calculate_shortest_path_lengths(G):
lengths = []
for component in nx.connected_components(G):
subgraph = G.subgraph(component)
if len(subgraph) > 1: # Only calculate for components with more than 1 node
for source_length_dict in nx.shortest_path_length(subgraph):
lengths.extend(source_length_dict[1].values())
return lengths
def plot_loglog_distribution(data, ax, metric_name):
# Calculate histogram values
counts, bin_edges = np.histogram(data, bins=50)
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
# Filter out zeros to avoid log(0)
nonzero_indices = counts > 0
log_bin_centers = np.log10(bin_centers[nonzero_indices])
log_counts = np.log10(counts[nonzero_indices])
# Scatter plot of the log-log values
ax.scatter(log_bin_centers, log_counts, color='r', label='Data points')
# Fit a line to the log-log data points
slope, intercept, _, _, _ = linregress(log_bin_centers, log_counts)
best_fit_line = slope * log_bin_centers + intercept
# Plot the best fit line
ax.plot(log_bin_centers, best_fit_line, 'b-', label=f'Best fit line (slope: {slope:.2f})')
# Set the plot labels and title
ax.set_xlabel(f'log10({metric_name})')
ax.set_ylabel('log10(Count)')
ax.set_title(f'Log-Log Plot of {metric_name} Distribution')
ax.legend()
def plot_distributions_loglog(G):
# Prepare the layout for the plots
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10))
plt.subplots_adjust(hspace=0.4, wspace=0.4)
# Plot each metric distribution on a log-log scale
metrics = {
'Degree': [G.degree(n) for n in G.nodes()],
'Clustering Coefficient': list(nx.clustering(G).values()),
'Shortest Path Length': calculate_shortest_path_lengths(G),
'Betweenness Centrality': list(nx.betweenness_centrality(G).values()),
'Closeness Centrality': list(nx.closeness_centrality(G).values()),
'Eigenvector Centrality': list(nx.eigenvector_centrality_numpy(G).values()),
}
for ax, (metric_name, data) in zip(axes.flatten(), metrics.items()):
if data: # Only plot if the data is not empty
plot_loglog_distribution(data, ax, metric_name)
# Adjust layout and display the plot
plt.tight_layout()
plt.show()
# Assuming G is a previously created NetworkX graph
plot_distributions_loglog(G)
Interesting! We do not see anything resembling a power-law distribution (straight line in log-log space) for degree. However, we do see power-law-like patterns in the distributions of betweenness and eigenvector centrality.
A power-law distribution of betweenness centrality indicates that there are a few nodes in the network that act as critical bridges or bottlenecks for the flow of information or resources. These nodes may not necessarily have the highest number of direct connections (as would be indicated by degree centrality), but they are positioned within the network such that many shortest paths pass through them. This would be great for passing messages; what do we think it could imply about a similarity space? Words that act as “hubs” in terms of betweenness centrality are those that can connect many other words through single-letter substitutions. These words are likely to be very “central” in the sense that they can be transformed into a large number of other words with minimal changes. Such words might not be the most common or have the most direct single-letter neighbors, but they play a crucial role in linking different clusters or parts of the network. They can bridge different groups of words, suggesting that they are versatile in the context of word transformations.
A power-law distribution of eigenvector centrality suggests that there is a hierarchy of influence among the nodes. Nodes with high eigenvector centrality are connected to other nodes that are themselves influential. This means that the network may have influential clusters that are not simply defined by the number of connections but by the influence of their connections. A power-law distribution in eigenvector centrality indicates that there are key words that are not only well-connected themselves but are also neighbors with other well-connected words. This could imply a clustering of similar words where the clusters are not merely groups of words with many neighbors, but groups where the words are mutually similar to other highly connected words. In terms of similarity, this might suggest that certain “core” words or root forms have variations that are also quite common and interconnected, leading to clusters of words with similar spellings and potentially similar meanings (if they are morphologically related, like ‘ring’, ‘rung’, and ‘rang’).
Let’s implement a formal test for small worldness.
Formal test for small worldness#
This is not necessarily a perfect approach, but it’s pretty good. The function will print out some key stats and then rate the graph as small world or not.
import networkx as nx
import numpy as np
def is_small_world(G, n_random_graphs=10):
"""
Assess whether a graph G is likely to be a small-world network.
Parameters:
- G: a NetworkX graph
- n_random_graphs: number of random graphs to generate for comparison
Returns:
- result: dictionary containing metrics and assessment
"""
start_time = time.time()
print('# Starting small world evaluation; this will take some time...')
# Calculate metrics for the original graph
original_clustering_coeff = nx.average_clustering(G)
if nx.is_connected(G):
original_path_length = nx.average_shortest_path_length(G)
else:
# Consider the largest connected component
largest_cc = max(nx.connected_components(G), key=len)
original_path_length = nx.average_shortest_path_length(G.subgraph(largest_cc))
# Metrics for equivalent random graphs
random_clustering_coeffs = []
random_path_lengths = []
for _ in range(n_random_graphs):
# Generate a random graph with the same number of nodes and edges
random_graph = nx.gnm_random_graph(G.number_of_nodes(), G.number_of_edges())
random_clustering_coeffs.append(nx.average_clustering(random_graph))
if nx.is_connected(random_graph):
random_path_lengths.append(nx.average_shortest_path_length(random_graph))
else:
# Consider the largest connected component
largest_cc_random = max(nx.connected_components(random_graph), key=len)
random_path_lengths.append(nx.average_shortest_path_length(random_graph.subgraph(largest_cc_random)))
# Averaging over the random graph metrics
avg_random_clustering_coeff = np.mean(random_clustering_coeffs)
avg_random_path_length = np.mean(random_path_lengths)
# Calculate the small-worldness metric S
if avg_random_clustering_coeff > 0 and avg_random_path_length > 0:
small_worldness = (original_clustering_coeff / avg_random_clustering_coeff) / \
(original_path_length / avg_random_path_length)
else:
small_worldness = None
# Determine if the graph is small-world
is_small_world = small_worldness is not None and small_worldness > 1
result = {
'Original Clustering Coefficient': original_clustering_coeff,
'Average Random Graph Clustering Coefficient': avg_random_clustering_coeff,
'Original Average Path Length': original_path_length,
'Average Random Graph Path Length': avg_random_path_length,
'Small-Worldness': small_worldness,
'Is Small-World': is_small_world
}
print(f'# Assessed small worldness in {time.time()-start_time} seconds')
return result
# Example usage:
## G = nx.newman_watts_strogatz_graph(100, 5, 0.1) # A graph that should exhibit small-world properties
result = is_small_world(G)
print(result)
# Starting small world evaluation; this will take some time...
# Assessed small worldness in 51.07087206840515 seconds
{'Original Clustering Coefficient': 0.3442593256015412, 'Average Random Graph Clustering Coefficient': 0.004106937741480478, 'Original Average Path Length': 5.828235925603215, 'Average Random Graph Path Length': 4.1513256980652375, 'Small-Worldness': 59.70590404642133, 'Is Small-World': True}
Okay, so our 4-letter word graph meets criteria for being a small world network. This implies a high number of hubs that shorten average path length without necessarily being scale-free.
Let’s do tests on random networks with functions that should create a small-world network and then another that should not.
Random small world graph#
GG = nx.newman_watts_strogatz_graph(100, 5, 0.1) # A graph that should exhibit small-world properties
result = is_small_world(GG)
print(result)
# Starting small world evaluation; this will take some time...
# Assessed small worldness in 0.17535185813903809 seconds
{'Original Clustering Coefficient': 0.4224285714285714, 'Average Random Graph Clustering Coefficient': 0.040717155067155064, 'Original Average Path Length': 4.599191919191919, 'Average Random Graph Path Length': 3.2157373960518294, 'Small-Worldness': 7.253955751290159, 'Is Small-World': True}
Random but not small world graph#
# now let's do one that should not be small world
GG = nx.gnp_random_graph(100, 0.004)
result = is_small_world(GG)
print(result)
# Starting small world evaluation; this will take some time...
# Assessed small worldness in 0.004016876220703125 seconds
{'Original Clustering Coefficient': 0.0, 'Average Random Graph Clustering Coefficient': 0.0, 'Original Average Path Length': 1.3333333333333333, 'Average Random Graph Path Length': 1.8566666666666667, 'Small-Worldness': None, 'Is Small-World': False}
Formal test for scale-free graphs (power law distribution of degree)#
This is similar to the test above for small world status, but now we are assessing power-law distribution of degree.
import networkx as nx
import powerlaw # You may need to install this with pip
import numpy as np
def is_scale_free(G, xmin=None):
"""
Assess whether a graph G is likely to be a scale-free network.
Parameters:
- G: a NetworkX graph
- xmin: minimum value of x to fit the power-law distribution.
If None, the optimal value will be estimated.
Returns:
- result: dictionary containing metrics and assessment
"""
degrees = [G.degree(n) for n in G.nodes()]
# Fitting the degree distribution to a power-law distribution
fit = powerlaw.Fit(degrees, xmin=xmin, discrete=True)
# Comparing the power-law fit with an exponential distribution
R, p = fit.distribution_compare('power_law', 'exponential', normalized_ratio=True)
# The graph is considered scale-free if the power-law fit is significantly better than the exponential
is_scale_free = R > 0 and p < 0.05
result = {
'Alpha': fit.power_law.alpha,
'Sigma': fit.power_law.sigma,
'Xmin': fit.power_law.xmin,
'R': R,
'p': p,
'Is Scale-Free': is_scale_free
}
return result
# Example usage:
## G = nx.barabasi_albert_graph(100, 2) # A graph that should exhibit scale-free properties
result = is_scale_free(G)
print(result)
Calculating best minimal value for power law fit
{'Alpha': 11.260922080235627, 'Sigma': 1.9054053956977333, 'Xmin': 16.0, 'R': -2.5441181370956603, 'p': 0.010955402522346992, 'Is Scale-Free': False}
Values less than or equal to 0 in data. Throwing out 0 or negative values
Now let’s try with a random network that should be scale free. In an earlier approach, it wasn’t always working, so I set it up to do multiple random networks. Then I found the scale_free_graph function, which seems to always work.
def test_scale_free(n, iterations=10):
scale_free_count = 0
for i in range(iterations):
# Generate a scale-free graph
BAG = nx.scale_free_graph(n)
# Calculate degree sequence and fit power law
degree_sequence = sorted([d for n, d in BAG.degree()], reverse=True)
fit = powerlaw.Fit(degree_sequence, xmin=1)
# Perform the goodness of fit test
R, p = fit.distribution_compare('power_law', 'exponential')
# Determine if the network is scale-free based on p-value
if p < 0.05 and R > 0:
scale_free_count += 1
#print(f'{scale_free_count}!')
# Optionally print out the results for each iteration
print(f"Iteration {i+1}: Alpha={fit.alpha:.4f}, Sigma={fit.sigma:.4f}, "
f"Xmin={fit.xmin}, R={R:.4f}, p={p:.4f}, Is Scale-Free={'True' if p < 0.05 and R > 0 else 'False'}")
return scale_free_count
# Parameters for Barabási-Albert graph (n: number of nodes, m: number of edges to attach from a new node to existing nodes)
n = 1000 # number of nodes
#m = 5 # number of edges to attach from a new node to existing nodes
# Run the test iter times
iter = 10
scale_free_networks = test_scale_free(n, iterations=iter)
print(f"{scale_free_networks} out of {iter} networks were classified as scale-free.")
Iteration 1: Alpha=2.6781, Sigma=0.0531, Xmin=1.0, R=1115.0347, p=0.0000, Is Scale-Free=True
Iteration 2: Alpha=2.7678, Sigma=0.0559, Xmin=1.0, R=1214.2585, p=0.0000, Is Scale-Free=True
Iteration 3: Alpha=2.6212, Sigma=0.0513, Xmin=1.0, R=1080.0611, p=0.0000, Is Scale-Free=True
Iteration 4: Alpha=2.7064, Sigma=0.0540, Xmin=1.0, R=1149.5306, p=0.0000, Is Scale-Free=True
Iteration 5: Alpha=2.6817, Sigma=0.0532, Xmin=1.0, R=1142.4686, p=0.0000, Is Scale-Free=True
Iteration 6: Alpha=2.6481, Sigma=0.0521, Xmin=1.0, R=1057.2522, p=0.0000, Is Scale-Free=True
Iteration 7: Alpha=2.5665, Sigma=0.0495, Xmin=1.0, R=965.5142, p=0.0000, Is Scale-Free=True
Iteration 8: Alpha=2.6586, Sigma=0.0524, Xmin=1.0, R=1099.3647, p=0.0000, Is Scale-Free=True
Iteration 9: Alpha=2.8398, Sigma=0.0582, Xmin=1.0, R=1172.7338, p=0.0000, Is Scale-Free=True
Iteration 10: Alpha=2.7668, Sigma=0.0559, Xmin=1.0, R=1218.6752, p=0.0000, Is Scale-Free=True
10 out of 10 networks were classified as scale-free.
Visualizing the network#
Okay, a possibly intersting thing to do – crucial for any time of data or model – is to visualize the graph itself. The next few sections develop functions for doing this.
Let’s start with a very basic approach.
import plotly.graph_objects as go
import networkx as nx
# Generate a random graph for demonstration (or use your own graph 'G')
#G = nx.gnp_random_graph(1000, 0.004)
# Proportion of nodes to label
proportion = 0.1
# Calculate the number of nodes to label
num_nodes_to_label = int(proportion * len(G.nodes()))
# Randomly select nodes to label
# nodes_to_label = random.sample(G.nodes(), num_nodes_to_label)
# Randomly select nodes to label by first converting the nodes to a list
nodes_to_label = random.sample(list(G.nodes()), num_nodes_to_label)
# Add labels to the selected nodes
labels = {node: str(node) for node in nodes_to_label}
# Generate positions for each node using one of the layout functions
pos = nx.spring_layout(G)
# Scale the y-coordinates to stretch the graph vertically
for node, (x, y) in pos.items():
pos[node] = (x, y * 2) # Increase the factor to stretch more if needed
# Now plot the graph using Plotly as before
edge_x = []
edge_y = []
for edge in G.edges():
x0, y0 = pos[edge[0]]
x1, y1 = pos[edge[1]]
edge_x.extend([x0, x1, None])
edge_y.extend([y0, y1, None])
node_x = []
node_y = []
for node in pos:
x, y = pos[node]
node_x.append(x)
node_y.append(y)
edge_trace = go.Scatter(
x=edge_x, y=edge_y,
line=dict(width=1, color='red'), # Make sure the edge width and color are set to be visible
hoverinfo='none',
mode='lines')
node_trace = go.Scatter(
x=node_x, y=node_y,
mode='markers+text',
text=[labels.get(node, '') for node in G.nodes()],
textposition="top center",
hoverinfo='text',
hovertext=[str(node) for node in G.nodes()], # Set hover text for each node
marker=dict(
showscale=True,
colorscale='YlGnBu',
reversescale=True,
color=[],
size=3,
colorbar=dict(
thickness=15,
title='Node Connections',
xanchor='left',
titleside='right'
),
line_width=2))
fig = go.Figure(data=[edge_trace, node_trace],
layout=go.Layout(
title='<br>Network graph made with NetworkX',
titlefont_size=16,
showlegend=False,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
annotations=[],
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
# Set the figure size here
width=1000, # Width of the figure in pixels
height=800, # Height of the figure in pixels, adjust as necessary
))
fig.show()
The graph above shows a large ‘giant component’ in the center. Within the GC, you can get to any other node in the GC by traversing links. We also see some small connected subgraphs that are unconnected to the GC. A random subset of 10% of nodes are labeled. You can see the label for any node by hovering your mouse over it.
But this is a rather inflexible bit of code. Let’s make a function with more flexibility.
A better visualizing function#
The function below allows you to specify giant=True if you only want to visualize the GC, or node_of_interest='sell' (or whatever word you want) if you just want to see the subgraph for one word and its neighbors. Run the code and then the examples in the next few code blocks.
import plotly.graph_objects as go
import networkx as nx
import random
def visualize_graph(G, proportion_to_label=0.05, gwidth=800, gheight=800, nsize=10, ncol='black', ecol='red', giant=False, node_of_interest=None):
# First, calculate the degrees using the full graph
full_graph_degrees = dict(G.degree())
if giant:
# Extract the largest connected component as a subgraph
largest_cc = max(nx.connected_components(G), key=len)
G = G.subgraph(largest_cc).copy()
title = '<br>Largest Component in Network Graph'
# Else if a specific node is provided, create a subgraph with the node and its neighbors
elif node_of_interest is not None and not giant:
# Check if a specific node is provided and exists in the graph
if node_of_interest not in G:
print(f"Node of interest '{node_of_interest}' not found in the graph.")
return # Stop the function if the node of interest is not found
neighbors = list(G.neighbors(node_of_interest))
neighbors.append(node_of_interest) # Include the node_of_interest in the subgraph
G = G.subgraph(neighbors).copy()
# Change the title to include the node label, its degree, and clustering coefficient
degree = G.degree(node_of_interest)
clustering_coefficient = nx.clustering(G, node_of_interest)
title = f"<br>Node '{node_of_interest}': Degree {degree}, Clustering Coefficient {clustering_coefficient:.2f}"
# Create a subgraph with the node and its neighbors
neighbors = list(G.neighbors(node_of_interest))
neighbors.append(node_of_interest)
G = G.subgraph(neighbors).copy()
else:
title = '<br>Network Graph'
# Calculate the number of nodes to label
num_nodes_to_label = int(proportion_to_label * len(G.nodes()))
# Randomly select nodes to label by first converting the nodes to a list
nodes_to_label = random.sample(list(G.nodes()), num_nodes_to_label)
# Add labels to the selected nodes
labels = {node: str(node) for node in nodes_to_label}
# Generate positions for each node using a layout function
pos = nx.spring_layout(G)
# Scale the y-coordinates to stretch the graph vertically
for node, (x, y) in pos.items():
pos[node] = (x, y * 2) # Increase the factor to stretch more if needed
# Create edge traces
edge_x = []
edge_y = []
for edge in G.edges():
x0, y0 = pos[edge[0]]
x1, y1 = pos[edge[1]]
edge_x.extend([x0, x1, None])
edge_y.extend([y0, y1, None])
# Create node traces
node_x = []
node_y = []
node_color = [] # List to hold the colors based on node degree
hover_text = [] # List to hold hover text for each node
for node in pos:
x, y = pos[node]
node_x.append(x)
node_y.append(y)
node_color.append(G.degree(node)) # Append the degree of each node
# Create hover text string for each node
# Use degrees from the full graph
#node_color.append(full_graph_degrees[node])
hover_text.append(f"{node} {G.degree(node)}:{full_graph_degrees[node]}")
edge_trace = go.Scatter(
x=edge_x, y=edge_y,
line=dict(width=1, color=ecol),
hoverinfo='none',
mode='lines')
node_trace = go.Scatter(
x=node_x, y=node_y,
mode='markers+text',
text=[labels.get(node, '') for node in G.nodes()],
textposition="top center",
hoverinfo='text',
hovertext=hover_text, # Set hover text for each node
marker=dict(
showscale=True,
colorscale='YlGnBu',
reversescale=True,
color=node_color, # Use node degrees for colors
size=nsize,
colorbar=dict(
thickness=15,
title='Node Degree',
xanchor='left',
titleside='right'
),
line_width=2))
# Define the figure with the new or default title
fig = go.Figure(data=[edge_trace, node_trace],
layout=go.Layout(
title=title,
titlefont_size=16,
showlegend=False,
hovermode='closest',
margin=dict(b=20,l=5,r=5,t=40),
annotations=[],
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
width=gwidth,
height=gheight
))
# Display the figure
fig.show()
# Use the function with a sample graph
#GG = nx.gnp_random_graph(30, 0.05)
# Use either giant or node_of_interest, not both
#visualize_graph(G, proportion_to_label=0.1, gwidth=1000, giant=True) # Only the largest component
# visualize_graph(GG, proportion_to_label=0.1, gwidth=1000, node_of_interest='tree') # Only the node 'tree' and its neighbors
visualize_graph(G, proportion_to_label=.01, gwidth=2000, gheight=1000, nsize=5, ecol='cyan') # Full graph; made it wider and taller using gwidth and gheight; changed edge color with ecol and node size with nsize
visualize_graph(G, proportion_to_label=0.5, gwidth=2000, nsize=5, giant=True) # Only the largest component because giant=True; proportion_to_label=0.5 labels half the nodes; hover to see the labels and degrees
visualize_graph(G, proportion_to_label=1, gwidth=1000, node_of_interest='best', nsize=10) # Only best and its neighbors, make nsize larger, and label every node (proportion_to_label=1)
visualize_graph(G, proportion_to_label=1, gwidth=1000, node_of_interest='zest', nsize=10) # Only zest and its neighbors
3d visualization#
Great! Now let’s make an interactive 3d version. You can use your mouse to click and grab to rotate the network. You can also zoom in and out.
import plotly.graph_objects as go
import networkx as nx
import random
def visualize_graph_3d(G, proportion_to_label=0.05, gwidth=800, gheight=800, nsize=3, ncol='black', ecol='red', ewidth=3, giant=False, node_of_interest=None):
# First, calculate the degrees using the full graph
full_graph_degrees = dict(G.degree())
if giant:
# Extract the largest connected component as a subgraph
largest_cc = max(nx.connected_components(G), key=len)
G = G.subgraph(largest_cc).copy()
title = '<br>Largest Component in Network Graph'
elif node_of_interest is not None:
# Check if the node of interest is in the graph
if node_of_interest not in G:
print(f"Node of interest '{node_of_interest}' not found in the graph.")
return # Stop the function if the node of interest is not found
# Create a subgraph with the node and its neighbors
neighbors = list(G.neighbors(node_of_interest))
neighbors.append(node_of_interest)
G = G.subgraph(neighbors).copy()
degree = G.degree(node_of_interest)
clustering_coefficient = nx.clustering(G, node_of_interest)
title = f"<br>Node '{node_of_interest}': Degree {degree}, Clustering Coefficient {clustering_coefficient:.2f}"
else:
title = '<br>Network Graph'
# Calculate the number of nodes to label
num_nodes_to_label = int(proportion_to_label * len(G.nodes()))
# Randomly select nodes to label by first converting the nodes to a list
nodes_to_label = random.sample(list(G.nodes()), num_nodes_to_label)
#
text_labels = []
# Generate 3D positions for each node using a layout function
pos = nx.spring_layout(G, dim=3)
# Generate positions for each node using a 3D layout function
pos = nx.spring_layout(G, dim=3)
# Create edge traces
edge_x = []
edge_y = []
edge_z = []
for edge in G.edges():
x0, y0, z0 = pos[edge[0]]
x1, y1, z1 = pos[edge[1]]
edge_x.extend([x0, x1, None])
edge_y.extend([y0, y1, None])
edge_z.extend([z0, z1, None])
# Create node traces
node_x = []
node_y = []
node_z = []
node_color = []
hover_text = []
text_labels = {}
for node in pos:
x, y, z = pos[node]
node_x.append(x)
node_y.append(y)
node_z.append(z)
node_color.append(full_graph_degrees[node]) # Use degrees from the full graph for color
# Construct hover text string for each node
hover_text.append(f"{node} {G.degree(node)}:{full_graph_degrees[node]}")
# Only add label text if the node is in the nodes_to_label list
text_labels[node] = str(node) if node in nodes_to_label else ''
edge_trace = go.Scatter3d(
x=edge_x, y=edge_y, z=edge_z,
line=dict(width=ewidth, color=ecol),
hoverinfo='none',
mode='lines')
node_trace = go.Scatter3d(
x=node_x, y=node_y, z=node_z,
mode='markers+text',
text=[text_labels.get(node, '') for node in G.nodes()], # Access labels using .get method
textposition="top center",
hoverinfo='text',
hovertext=hover_text,
marker=dict(
showscale=True,
colorscale='YlGnBu',
reversescale=True,
color=node_color,
size=nsize,
colorbar=dict(
thickness=15,
title='Node Degree',
xanchor='left',
titleside='right'
),
line_width=2))
# Define the figure with the new or default title
fig = go.Figure(data=[edge_trace, node_trace],
layout=go.Layout(
title=title,
titlefont_size=16,
showlegend=False,
hovermode='closest',
margin=dict(b=20, l=5, r=5, t=40),
annotations=[],
scene=dict(
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
zaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
),
width=gwidth,
height=gheight
))
# Display the figure
fig.show()
# Use the function with a sample graph
#GG = nx.gnp_random_graph(30, 0.05)
# Use either giant or node_of_interest, not both
visualize_graph_3d(G, proportion_to_label=0.1, gwidth=2000, gheight=1000) # Full graph
visualize_graph_3d(G, proportion_to_label=0.1, gwidth=2000, gheight=1000, giant=True, nsize=5) # Only the largest component
visualize_graph_3d(G, proportion_to_label=1, node_of_interest='best', nsize=10, ewidth=3, gwidth=1000) # Only best and its neighbors
visualize_graph_3d(G, proportion_to_label=1, node_of_interest='zest', nsize=10) # Only best and its neighbors
import networkx as nx
from itertools import combinations
def find_triangles_and_ratio(G):
triples_count = 0
triangles_count = 0
triangles_list = []
for node in G.nodes():
neighbors = list(G.neighbors(node))
# Count the number of triples involving the current node
triples_count += len(list(combinations(neighbors, 2)))
for n1, n2 in combinations(neighbors, 2):
if G.has_edge(n1, n2):
# Sort the nodes to avoid duplicates
triangle = sorted([node, n1, n2])
if triangle not in triangles_list:
triangles_list.append(triangle)
triangles_count += 1
# The ratio of triangles to triples is the number of triangles divided by the number of triples
ratio = triangles_count / triples_count if triples_count > 0 else 0
return triples_count, triangles_count, ratio
# Example usage:
# G = nx.erdos_renyi_graph(n=10, p=0.5)
triples, triangles, ratio = find_triangles_and_ratio(G)
print(f"Number of triples: {triples}")
print(f"Number of triangles: {triangles}")
print(f"Ratio of triangles to triples: {ratio}")
Number of triples: 35811
Number of triangles: 4596
Ratio of triangles to triples: 0.12834045405043143
Doing something practical: let’s use network science to help us understand human word recognition#
We are going to read in lexical decision times from a database called ‘the English Lexicon Project’, or ELP.
# Data for something like 30k words from the ELP
elp_path = "./elp_reduced.csv"
# Read the CSV file
elp = pd.read_csv(elp_path, na_filter=False)
elp.head()
# Now let's read in the original lemmalex data again
# 17,750 words from the LexFindR R package
file_path = "./lemmalex.csv"
# Read the CSV file
alldata = pd.read_csv(file_path, na_filter=False)
# Merging the dataframes on 'Item' column where entries exist in both dataframes
elp_lemmalex = pd.merge(elp, data, on='Item', how='inner')
print(f'Starting with {len(alldata)} entries in lemmalex and {len(elp)} entries in ELP. Merged dataframe has {len(elp_lemmalex)} entries. ')
elp_lemmalex.head()
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[28], line 5
2 elp_path = "./elp_reduced.csv"
4 # Read the CSV file
----> 5 elp = pd.read_csv(elp_path, na_filter=False)
7 elp.head()
9 # Now let's read in the original lemmalex data again
10 # 17,750 words from the LexFindR R package
File ~/miniconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
1013 kwds_defaults = _refine_defaults_read(
1014 dialect,
1015 delimiter,
(...)
1022 dtype_backend=dtype_backend,
1023 )
1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)
File ~/miniconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
617 _validate_names(kwds.get("names", None))
619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
622 if chunksize or iterator:
623 return parser
File ~/miniconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
1617 self.options["has_index_names"] = kwds["has_index_names"]
1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)
File ~/miniconda3/lib/python3.12/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
1878 if "b" not in mode:
1879 mode += "b"
-> 1880 self.handles = get_handle(
1881 f,
1882 mode,
1883 encoding=self.options.get("encoding", None),
1884 compression=self.options.get("compression", None),
1885 memory_map=self.options.get("memory_map", False),
1886 is_text=is_text,
1887 errors=self.options.get("encoding_errors", "strict"),
1888 storage_options=self.options.get("storage_options", None),
1889 )
1890 assert self.handles is not None
1891 f = self.handles.handle
File ~/miniconda3/lib/python3.12/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
868 elif isinstance(handle, str):
869 # Check whether the filename is to be opened in binary mode.
870 # Binary mode does not support 'encoding' and 'newline'.
871 if ioargs.encoding and "b" not in ioargs.mode:
872 # Encoding
--> 873 handle = open(
874 handle,
875 ioargs.mode,
876 encoding=ioargs.encoding,
877 errors=errors,
878 newline="",
879 )
880 else:
881 # Binary mode
882 handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: './elp_reduced.csv'
It’s surprising that we only have 1259 matches, but that’s still enough items to do something interesting with. We have recognition time (RT) data for each of these words. Let’s see how RT compares to network properties. We will first do an analyze_graph report on the full lemmalex – this may take several minutes.
# analyze lemmalex
# first we have to create the graph; we are going to overwrite G to just keep things simple for now
print(f'Starting network creation with {str(len(data))} entries from {file_path}')
# Create an empty undirected graph
G = nx.Graph()
# Add words from the 'Item' column as nodes to the graph
G.add_nodes_from(alldata['Item'])
# Create a dictionary to group words by their length
words_by_length = {}
excluded_items_count = 0
for idx, word in enumerate(alldata['Item']):
if pd.notnull(word): # This will check if the word is not NaN or None
str_word = str(word)
words_by_length.setdefault(len(str_word), []).append(str_word)
else:
# Print the entire row that contains the bad data
print(f"Excluded row at index {idx}:\n{alldata.iloc[idx]}")
excluded_items_count += 1
# Add edges between words of the same length that differ by only one letter
start_time = time.time()
processed_items_count = 0
for length, words in words_by_length.items():
for i, word1 in enumerate(words):
for word2 in words[i+1:]: # Only compare each pair once
if substitution_neighbors(word1, word2):
G.add_edge(word1, word2)
processed_items_count += 1
if processed_items_count % 250 == 0:
elapsed_time = time.time() - start_time
seconds_per_word = elapsed_time / processed_items_count
words_left = len(data['Item']) - processed_items_count
seconds_to_go = words_left * seconds_per_word
print(f"{processed_items_count} items processed. Average time per word: {seconds_per_word:.4f} seconds. Estimated time to go: {seconds_to_go:.2f} seconds.")
total_time = time.time() - start_time
print (f'---> Created network in {total_time:.2f} seconds')
print(f'Excluded {excluded_items_count} items due to NaN or None')
# Display basic information about the updated graph
num_nodes = G.number_of_nodes()
num_edges = G.number_of_edges()
print(f'{num_nodes} nodes with {num_edges} edges')
# now get the report
# this part may take quite a while; it takes 9 minutes on my very fast mac
metrics_df, node_df = analyze_graph(G, compare_random=True, return_node_df=True)
# Let's see the table
metrics_df
print(elp_lemmalex.head())
print(node_df.head())
# Now let's merge the nodes_df (which has node-specific stats on degree, CC, etc.) with the elp_lemmalex dataframe and see how many records we still have -- should be the same as before
# first let's rename 'Node' in node_df to 'Item' for easier matching
node_df = node_df.rename(columns={'Node': 'Item'})
# Merging the dataframes on 'Item' column where entries exist in both dataframes
elp_ll_net = pd.merge(elp_lemmalex, node_df, on='Item', how='inner')
print(f'Started with {len(alldata)} entries in lemmalex and {len(elp)} entries in ELP. Merged dataframe had {len(elp_lemmalex)} entries. New merged dataframe has {len(elp_ll_net)} entries.')
# save the data
# elp_ll_net.to_csv('elp_ll_net.csv', index=False)
from scipy.stats import pearsonr
# We'll now create a new variable in our DataFrame for the log-transformed Frequency
elp_ll_net['Log_Frequency'] = np.log(elp_ll_net['Frequency'])
# Reduce the dataframe to only include the variables of interest
vars_to_use = ['Item', 'I_Mean_RT', 'Log_Frequency', 'Degree', 'Clustering Coefficient', 'Betweenness Centrality', 'Closeness Centrality', 'Eigenvector Centrality']
elp_ll_net_reduced = elp_ll_net[vars_to_use]
elp_ll_net_reduced.to_csv('elp_ll_net_reduced.csv', index=False)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Read the dataset
#elp_ll_net_reduced = pd.read_csv('./elp_ll_net_reduced.csv')
# Display the first few rows
elp_ll_net_reduced.head()
# Sorting the I_Mean_RT values numerically
sorted_mean_rt = elp_ll_net_reduced['I_Mean_RT'].sort_values()
# Displaying the sorted values
sorted_mean_rt
# Checking for NaNs in all columns
nan_info = elp_ll_net_reduced.isna().sum()
nan_info
elp_ll_net_reduced.head()
Examine the data#
Let’s use a scaterplot matrix like we did with our pseudodata a few homeworks ago.
# run this block; it gives an error, but it prevents a later error. I do not understand why.
elp_ll_net_reduced['I_Mean_RT'] = pd.to_numeric(elp_ll_net_reduced['I_Mean_RT'])
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.stats import pearsonr
# print(elp_ll_net_reduced['I_Mean_RT'])
# elp_ll_net_reduced['I_Mean_RT'] = pd.to_numeric(elp_ll_net_reduced['I_Mean_RT'])
# print(elp_ll_net_reduced.head())
# print(elp_ll_net_reduced.dtypes)
# Define the correlation function with font size proportional to the correlation value
def corrfunc(x, y, **kws):
# x, y = x.dropna(), y.dropna() # Drop NaNs
# if len(x) > 0 and len(y) > 0: # Only compute correlation if there are non-NaN data
r, _ = pearsonr(x, y) # Compute the Pearson correlation coefficient
ax = plt.gca()
# Set font size proportional to the absolute value of the correlation coefficient
fontsize = max(10, min(20, abs(r) * 40))
ax.annotate(f"R = {r:.2f}", xy=(.1, .9), xycoords=ax.transAxes,
fontweight="bold", fontsize=fontsize)
# Create the scatterplot matrix
sns.set(style="white")
pair_grid = sns.PairGrid(elp_ll_net_reduced, height=2.5)
pair_grid = pair_grid.map_upper(sns.regplot, scatter_kws={'s':1})
pair_grid = pair_grid.map_upper(corrfunc)
pair_grid = pair_grid.map_lower(sns.regplot, scatter_kws={'s':1})
pair_grid = pair_grid.map_diag(plt.hist)
# Show the plot
plt.show()
Multiple regression to examine relationships#
Quick multiple regression…
print(elp_ll_net_reduced.head())
print(elp_ll_net_reduced.dtypes)
import statsmodels.api as sm
# Define the independent variables (all except 'I_Mean_RT')
independent_vars = elp_ll_net_reduced.columns.difference(['I_Mean_RT', 'Item'])
X = elp_ll_net_reduced[independent_vars]
# Add a constant term to the predictors
X = sm.add_constant(X)
# Define the dependent variable
y = elp_ll_net_reduced['I_Mean_RT']
# Fit the regression model
model = sm.OLS(y, X).fit()
# Get the summary of the regression
model_summary = model.summary()
model_summary
Briefly, we see that our model predicts about 40% of variance. For specific predictors, we see that P>|t| is less than 0.05 for only Log_Frequency and Degree. None of the other predictors help. Oh well.
Lab report#
Explain what it means for a network to be a small world network.
Choose a few more words to visualize. See if you can find anymore fully-connected (CC=1) examples. Can you find any examples with unusual seeming structure? Includes screenshots of your graphs.
Try a different word length. Go back to the early code block where the data is read in. Re-run those code blocks, and change the length constraints (e.g., instead of > 3 and < 5, set to > 2 and < 4 for length of 3). If you call that graph something other than ‘G’ (e.g., G3) you will have an easier time comparing it to the 4-letter graph. Alternatively, copy and save the table summarizing the 4-letter graph. Compare and contrast the two graphs. Are there any salient differences?
Also do the small-world and scale-free tests. How different is your graph from the 4-letter graph?
Compare the
analyze_graphreports for the 4-letter graph and the full lexicon graph. What kinds of differences do you see? What do you think about these differences?
Challenge questions: Grad students and honors students do at least one of these. To learn more, do more than 1!#
Programming challenge: Make a new version of the
analyze_graphfunction that allows an option for the user to pass 2 graphs, and then print a comparison of those.Programming challenge: Find the
substitution_neighborsfunction. Can you createaddition_neighborsanddeletion_neighborsfunctions? Then create a graph based on DAS neighbors (two nodes are neighbors if they differ by a single Deleition, Addition, or Substituion; currently the code only checks substitution neighbors).Analysis challenge: With the complete, 17,750-word lexicon analyzed in the last part above, run the plots and analyses again (using only substitution neighbors initially). Note that some analyses will take a long time – several minutes in some cases. Trying to graph the whole network might take really long. Report what kinds of things you see when you plot the whole graph, and what happens if you try to plot the giant component. Is there a giant component? What can you figure out about it?
Analysis x programming challenge: If you did #6, try analyzing the whole network using DAS neighbors. What do you learn?
Analysis x programming challenge: Can you find a way to get a postive effect of clustering coefficient on the RT data? Try to define levels of high and low frequency, high and low degree (neighborhood), and high and low CC. The high and low ranges within a metric must not overlap. Can you find at least 20 items per cell (8 cells – 2 x 2 x 2) that give you good separation on high and low levels? If so, can you limit the data to those items and the re-run the multiple regression? (Or run an ANOVA if you know what that means.) What results do you get?
Analysis challenge: Scale and center predictors before doing regression. Consider whether RT or any other variables should be transformed (e.g., with log).